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Abstract — We consider the problem of reconstructing a low 
rank matrix from noisy observations of a subset of its entries. 
This task has applications in statistical learning, computer vision, 
and signal processing. In these contexts, 'noise' generically refers 
to any contribution to the data that is not captured by the 
low-rank model. In most applications, the noise level is large 
compared to the underlying signal and it is important to avoid 
overfltting. In order to tackle this problem, we define a regularized 
cost function well suited for spectral reconstruction methods. 
Within a random noise model, and in the large system limit, we 
prove that the resulting accuracy undergoes a phase transition 
depending on the noise level and on the fraction of observed 
entries. The cost function can be minimized using OptSpace 
(a manifold gradient descent algorithm). Numerical simulations 
show that this approach is competitive with state-of-the-art 
alternatives. 

I. Introduction 

Let N be an m x n matrix which is 'approximately' low 
rank, that is 



N = M + W = UYV 1 



W. 



(1) 



where U has dimensions m x r, V has dimensions nxr, and 
E is a diagonal r x r matrix. Thus M has rank r and W can 
be thought of as noise, or 'unexplained contributions' to N. 
Throughout the paper we assume the normalization U T U = 
ml rxr and V T V — nl rxr (Jdxd being the d x d identity). 

Out of the m x n entries of N, a subset E C [m] x [n] is 
observed. We let Ve(N) be the m x n matrix that contains 
the observed entries of TV, and is filled with O's in the other 
positions 



V E {N)n = 



if (i,j)eE, 

otherwise. 



(2) 



The noisy matrix completion problem requires to reconstruct 
the low rank matrix M from the observations T' E (N). In the 
following we will also write N E = Ve(N) for the sparsified 
matrix. Over the last year, matrix completion has attracted 
significant attention because of its relevance -among other 
applications- to colaborative filtering. In this case, the matrix 
TV contains evaluations of a group of customers on a group of 
products, and one is interested in exploiting a sparsely filled 
matrix to provide personalized recommendations |Q]]. 

In such applications, the noise W is not a small perturbation 
and it is crucial to avoid overfltting. For instance, in the limit 
M — > 0, the estimate of M risks to be a low-rank approxima- 
tion of the noise W, which would be grossly incorrect. 

In order to overcome this problem, we propose in this paper 
an algorithm based on minimizing the following cost function 
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F E {X,Y;S)= -\\V e {N-XSY a 



II 



1 



:A||S1 



F ■ 



(3) 



;A||S|| 2 F . 



(4) 



Here the minimization variables are S £ R rxr , and X £ 
]R mxr , Y £ R nxr with X T X = Y T Y = I rxr . Finally, A > 
is a regularization parameter. 

A. Algorithm and main results 

The algorithm is an adaptation of the OptSpace algorithm 
developed in |2). A key observation is that the following 
modified cost function can be minimized by singular value 
decomposition (see Section ITT1 ): 

F E (X,Y-S) = \\\V E {N)-XSY T \\ 2 F + 1 -. 

As emphasized in (21, Q, which analyzed the case A = 0, 
this minimization can yield poor results unless the set of 
observations E is 'well balanced'. This problem can be 
bypassed by 'trimming' the set E, and constructing a balanced 
set E. The OptSpace algorithm is given as follows. 

OptSpace ( set E, matrix N E ) 

1: Trim E, and let E be the output; 

2: Minimize F^X, Y; S) via SVD, 

let X , Y , Sq be the output; 
3: Minimize Te(X, Y;S) by gradient descent 

using Xo,Yo,Sq as initial condition. 

In this paper we will study this algorithm under a model 
for which step 1 (trimming) is never called, i.e. E = E with 
high probability. We will therefore not discuss it any further. 
Section [TT] compares the behavior of the present approach 
with alternative schemes. Our main analytical result is a sharp 
characterization of the mean square error after step 2. Here 
and below the limit n — > oo is understood to be taken with 
m/n — > a £ (0, oo). 

Theorem LI. Assume |-My| < M max , Wij to be Ltd. random 
variables with mean variance y/mna 2 and E{W^ } < Cn 2 , 
and that for each entry Nij is observed (i.e. £ E) 

independently with probability p. Finally let M — XqSoY q t 
be the rank r matrix reconstructed by step 2 of OPTSPACE, 
for the optimal choice of A. Then, almost surely for n — > oo 

1 



|M 



M- M t = 1 - 



{EU4 + f)( 1 + *)} 



|s|| 2 F 



o„(l). 



This theorem focuses on a high-noise regime, and predicts 
a sharp phase transition: if a 2 /p < Ei, we can successfully 
extract information on M, from the observations 7V B . If on 



the other hand a 2 /p > Si, the observations are essentialy 
useless in reconstructing M. It is possible to prove that the 
resulting tradeoff between noise and observed entries is tight: 
no algorithm can obtain relative mean square error smaller 
than one for a 2 /p > Si, under a simple random model for 
M. To the best of our knowledge, this is the first sharp phase 
transition result for low rank matrix completion. 

For the proof of Theorem 11.11 we refer to Section [HI] An 
important byproduct of the proof is that it provides a rule for 
choosing the regularization parameter A, in the large system 
limit. 

B. Related work 

The importance of regularization in matrix completion is 
well known to practitioners. For instance, one important com- 
ponent of many algorithms competing for the Netflix challenge 
(T), consisted in minimizing the cost function He(X, Y; S) = 
±\\V E (N - XY T )\\ 2 F + ±\\\X\\ 2 F + ±\\\Y\\ 2 F (this is also 
known as maximum margin matrix factorization J5), ||6)). Here 



Unlike in OptSpace, these matrices are not constrained to 
be orthogonal, and as a consequence the problem becomes 
significantly more degenerate. Notice that, in our approach, 
the orthogonality constraint fixes the norms H^Hf- 
This motivates the use of WSW 2 ^ as a regularization term. 

Convex relaxations of the matrix completion problem were 
recently studied in 0, JSJ, As emphasized by Mazumder, 
Hastie and Tibshirani [9|, such nuclear norms relaxations 
can be viewed as spectral regularizations of a least square 
problem. Finally, the phase transition phenomenon in Theorem 
11.11 generalizes a result of Johnstone and Lu on principal 
component analysis ifTUll . and similar random matrix models 
were studied in ifTTI . 

II. Numerical simulations 

In this section, we present the results of numerical sim- 
ulations on synthetically generated matrices. The data are 
generated following the recipe of J9): sample U G R nxr 
and V € R mxr 
indentically as Af(Q, 1 
by choosing Wu iid with distribution Af(0, a 2 yjmn). Set 

N = UV + W . We also use the parameters chosen in [9] 
and define 



by choosing Uij and Vij independently and 
Sample independently W £ R™*" 



SNR 
TestError 
TrainError 



'Var((t/y 



Var(Wy) 

T 



\n{uv 



N)\\% 



\\muv T )\\ 2 F 

\\V E (N-N)\\ F 



\\Ve(N)\\\ 



where V^(A) = A-V E (A). 

In Figure Q] we plot the train error and test error for the 
OptSpace algorithm on matrices generated as above with 
n = 100, r = 10, SNR=1 and p = 0.5. For comparison, we 




Fig. 1. Test (top) and train (bottom) error vs. rank for OPTSPACE, SOFT- 
IMPUTE, HARD-IMPUTE and SVT Here m = n = 100, r = 10, p = 
0.5, SNR = 1. 



also plot the corresponding curves for Soft-Impute,Hard- 
Impute and SVT taken from @. In Figures |2] and [3] we 
plot the same curves for different values of r, e, SNR. In these 
plots, OptSpace(A) corresponds to the algorithm that min- 
imizes the cost (0. In particular OptSpace(O) corresponds 
to the algorithm described in [2|. Further, A* = A*(p) is the 
value of the regularization parameter that minimizes the test 
error while using rank p (this can be estimated on a subset of 
the data, not used for training). 

It is clear that regularization greatly improves the perfor- 
mance of OptSpace and makes it competitive with the best 
alternative methods. 

III. Proof of Theorem ITT1 

The proof of Theorem 1 is based on the following three 
steps: (i) Obtain an explicit expression for the root mean 
square error in terms of right and left singular vectors of N; 
(ii) Estimate the effect of the noise W on the right and left 
singular vectors; (Hi) Estimate the effect of missing entries. 
Step (ii) builds on recent estimates on the eigenvectors of large 
covariance matrices 1 12J. In step (Hi) we use the results of [2J. 
Step (i) is based on the following linear algebra calculation, 
whose proof we omit due to space constraints (here and below 
(A,B) = Tr(AB T )). 
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Fig. 2. Test (top) and train (bottom) error vs. rank for OPTSPACE, SOFT- 
IMPUTE, HARD-IMPUTE and SVT. Here m = n = 100, r = 6,p = 
0.5, SNR = 1. 



Fig. 3. Test (top) and train (bottom) error vs. rank for OPTSPACE, SOFT- 
IMPUTE, HARD-IMPUTE and SVT. Here m = n = 100, r = 5,p = 
0.2, SNR = 10. 



Proposition ULl. Let I B £ t mw and Y e R mxr be the 

matrices whose columns are the first r, right and left, singular 
vectors of N E . Then the rank-r matrix reconstructed by step 
2 of of OPTSPACE, with regularization parameter A, has the 
form M(X) = XqSq(X)Yq Further, there exists A» > such 
that 
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1 



ran 



\M - M(A* 



= IISI 



f (X^MY 0l X^N E Y ) 
\ Jnfn\\X N E Y \\ F 



(5) 



A. The effect of noise 

In order to isolate the effect of noise, we consider the matrix 
N = p UT,V T + W E . Throughout this section we assume that 
the hypotheses of Theorem 11.11 hold. 

Lemma III.2. Let {nz\^ n , . . . , nz r _ n ) be the r largest singular 



values of N. Then, as n 
for S| > a 2 /p, 



00, z i>r 



Zi almost surely, where, 



1/2 



(6) 



and z t = o\Jp~oM^(\ + ^fa) for E| < a 2 /p. 

Further, let X <G ]R mxr and Y £ R nxr be the matrices 
whose columns are the first r, right and left, singular vectors 



of N. Then there exists a sequence ofrxr orthogonal niatrices 
Q n such that, almost surely \\^^U T X — AQ n \\p — > 0, 
W^T^Y - BQ n \\ F -> with A = diag(ai, . . . , a r ), 
B = diag(6 1; . . . , b r ) and 



a 



b = 1- 



2v4 
4 



)(' 



yap 



(J 



for S| > a 2 fp, while a. 



v 2 n 



1 



otherwise. 



(7) 



Proof: Due to space limitations, we will focus here on the 
case Si, . . . , S r > a 2 /p. The general proof proceeds along 
the same lines, and we defer it to ||4]. 

Notice that W E is an m x n matrix with i.i.d. entries with 
variance ^Jmna 2 p and fourth moment bounded by Cn 2 . It 
is therefore sufficient to prove our claim for p = 1 and then 
rescale S by p and a by ^Jp. We will also assume that, without 
loss of generality, m > n. 

Let Zbeanrxr diagonal matrix containing the eigenvalues 
(nz nt i, . . . , nz n ^ r ). The eigenvalue equations read 



vp x + 



VWY - 
W T X 



xz = 

YZ = 



(8) 
(9) 



where we defined $ x 



ZU T X, (3 V 



E V Y G 



K, rxr . By singular value decomposition we can write W — 

Ldiag(wi, w 2l ■ ■ ■ w n )R T , with L T L = I mxm , R T R = 

'll X n • 

Let uj, xf, vj, yj G E. r be the i-th row of -respectively- 
L T U, L T X, R T V, R T Y. In this basis equations © and © 
read 



u i $ y + WiDi - x i Z = . 



u i fi v — x i Z 



0. 



vffi x + Wixf - yjz = . 
These can be solved to get 



i G [n\ , 
i G [m]\[n] 
ie\n]. 



Xi = 

= uf %Z 1 



{uJp y Z + Wl vJp x )(Z 2 ~w 2 y l , 



T 

h 



i g [n\ , 

i G [m]\M > 



(vf/3 x Z + w t uJ(3 y )(Z 2 ~ wfy 1 , i G [n]. (10) 



By definition S 

E"=i v iyT> whence 



YJiLi u i x T' and s 'A 



i=n+l 



(ID 



i=l 

Let A = a 1 / 2 /(m 2 (7 2 ). Then, it is a well known fact 
|f]~3l that as n — > oo the empirical law of the Aj's converges 
weakly almost surely to the Marcenko-Pastur law, with density 

p(A) - ayJ(\-cL)(c%-\)/(2Tr\), with c± = 1 ± a" 1 / 2 . 

Let ^ = fix/i/m, /3 y = f3 x /y/n, Z = Z/n. A priori, it 
is not clear that the sequence ((3 x ,(3 y ,Z) -dependent on n- 
converges. However, it is immediate to show that the sequence 
is tight, and hence we can restrict ourselves to a subsequence 
S = {rij} ie K along which a limit exists. Eventually we will 
show that the limit does not depend on the subsequence, apart, 
possibly, from the rotation Q n . Hence we shall denote the 
subsequential limit, by an abuse of notation, as (f3 x , f3 y , Z). 

Consider now a such a convergent subsequence. It is possi- 
ble to show that E 2 > a 2 /p implies Z^ > a 3 / 2 a 2 c + (a) 2 + S 
for some positive 5. Since almost surely as n —> oo, w 2 < 
(a) 2 + 5/2 for all z, for all purposes the summands 
on the rhs of Eqs. (fTTT i. (fT2l i can be replaced by uniformly 
continuous, bounded functions of the limiting eigenvalues A^. 
Further, each entry of Ui (resp. Vi) is just a single coordinate 
of the left (right) singular vectors of the random matrix W . 
Using Theorem 1 in fl2l . it follows that any subsequential 
limit satisfies the equations 



(3 X = Y,/3 y {z J (Z 2 - a 3/2 a 2 X)- 1 p{X)d\ + (a - l)^ 1 } , 

(13) 

p y = Hp x [z /"(Z 2 -a 3 / 2 ( x 2 A)- 1 p(A)dA},. (14) 



Solving for (3 y , we get an equation of the form 

Z- 2 f3 v =(3 y f(Z) 



(15) 



where /( ■ ) is a function that can be given explicitely using 
the Stieltjis transform of the measure p(A)dA. Equation ( TT3T > 
implies that (3 y is block diagonal according to the degeneracy 
pattern of E. Considering each block, either (3 y vanishes in 
the block (a case that can be excluded using E 2 > a 2 /p) 
or E^ 2 = f(Zn) in the block. Solving for Zu shows that 
the eigenvalues are uniquely determined (independent of the 
subsequence) and given by Eq. ©. 

In order to determine j3 x and /3 y first observe that, since 
Irxr = Y T Y = Yh=i VivT , we nave > us i n g Eq. G0]l 

n 

I rxr = ^(fr-vty^Zftvi + Wiffut) 

i=l 

{vJP x Z + W lU rPy)(Z 2 -W 2 )- 1 . 

In the limit n — > oo, and assuming a convergent subsequence 
for (Z, (3 X , /3 y ), this sum can be computed as above. After 



(Z 2 - crV 2 (7 2 A) 
o?I 2 g 2 \ 



— - P {\) d\}Cy, 



(Z 2 - a 3 / 2 o- 2 A) 

where C x = /3 X f3 x , C y — (3 y f3 y and the functions of Z on 
the rhs are defined as standard analyic functions of matrices. 

Using Eqs. dT~3T >. ( fT4l and solving the above, we get C x = 
diag(E 2 a 2 ,...E 2 a 2 ), and B y = diag(E 2 6 2 , . . . E 2 6 2 ). We 
already concluded that (3 X and f3 y are block diagonals with 
blocks in correspondence with the degeneracy pattern of E. 
Since (3^ f3 x = C x and (3 y f3 y — C y are diagonal, with the same 
degeneracy pattern, it follows that, inside each block of size d, 
each of f3 x and f3 y is proportional to a d x d orthogonal matrix. 
Therefore (3 X = YjAQ s , j3 y = T^BQ' S , for some othogonal 
matriced Q s , Q' s . Also, using equation dT3l > one can prove 
that Q s = Q' s . 

Notice, by the above argument A, B are uniquely fixed by 
our construction. On the other hand Q s might depend on the 
subsequence H. Since our statmement allows for a seqence 
of rotations Q n , that depend on n, the eventual subsequence 
dependence of Q s can be factored out. ■ 

It is useful to point out a straightforward consequence of 
the above. 

Corollary III.3. There exists a sequence of orthogonal ma- 
trices Q n G W xr such that, almost surely, 

1 



lim 



-X 1 C7EU J Y - QnDQi 



= 0, (16) 



with D — diag(Eiai&i, . . . , T, r a r b r ). 

B. The effect of missing entries 

The proof of Theorem |TT] is completed by establishing a 
relation between the singular vectors X , Y of N E and the 
singular vectors X and Y of N. 



Lemma III.4. Let k < r be the largest integer such that 
Si > ■ • • > S fe > cr 2 /p, and denote by X {k) , Y$ k \ X^ k \ and 
y( fe ) the matrices containing the first k columns of Xq, Yq, 
X, and Y, respectively. Let X {k) = X^S X + X { / ] , F (fe) = 
Y^S y +Y[ k) where (X^ ] ) T X^ = 0, (Y {k) ) T Y^ = 
and S x ,S y € W xr . Then there exists a numerical constant 
C = C(Sj, <7 2 , a, M max ), such that, with high probability, 

WX^WlAlY^Wl^Cr^, (17) 

with probability approaching 1 as n — > oo. 

Proof: We will prove our claim for the right singular 
vector Y, since the left case is completely analogous. Further 
we will drop the superscript k to lighten the notation. 

We start by noticing that \\N E Y \\ 2 F = £a=i( n ^,«) 2 ' 
where nz Q „ are the singular values of N E . Using Lemma 
3.2 in which bounds \\M E - P M\\ 2 = \\N E - N\\ 2 , we 
get 

k 

\\N E Y \\% > ]T K, B - CM^pTif . (18) 

a=l 

On the other hand \\N E Y \\ F < \\NY \\ F + \\N E - 
N\\ 2 \\Yq\\ f . Further by letting S y = L y Q y R y , for L y ,R y 
orthogonal matrices, we get ||7VY" |If = \WYL y Q y \\ 2 F + 
\\NY±\\ F . Since Y^Y = I kxk , we have I kxk = 
RyQTQyRT + Y[Yj_, and therefore 

\\NY \\ 2 F = \\NYL y \\ 2 F -\\NYL y R^Yl\\ 2 F + \\NY x \\ 2 F 

a=l 

+n 2 p<7 2 a(c+(a) + S)\\Y ± \\ 2 F 

k 

2 2 2 1 1 v 1 1 2 

= n 2^ Z a,n~ n ey\\Yj_\\ F , 

a=l 

where e y = z 2 — pa 2 a(c + (a) + S), and used the inequality 

11^1x11! < n 2 pa 2 a{c + {a) + 5)\\Y X \\ 2 F which holds for all 
5 > asymptotically almost surely as n — > oo (by an 
immediate generalization of Lemma IIII.2I ). It is simple to 
check that H k > a 2 jp implies e y > 0. 

Using triangular inequality, Lemma 3.2 in [2|, we get 

r 

\\NY \\% < n^zln-n'e^Y^l + Cnpa^Ml^r 

a=l 

+2Cn^a 3 / 4 M mllx V^\\z\\, 

which, combined with equation dl81 l. implies the thesis. ■ 
Proof of Theorem We now turn to upper bounding 
the right hand side of Eq. (0. Let k be defined as in the 
last lemma. Notice that by Lemma IIII.2I X T (UYy T )Y is 
well approximated by (X (fe) ) T (UT,V T )Y^ k \ Analogously, it 
can be proved that Xq (ITEV T )Y is well approximated by 
{X^ k) ) J \UY,V T )Y^ k) . Due to space limitations, we will omit 
this technical step and thus focus here on the case k = r 



(equivalently, neglect the error incurred by this approxima- 
tion). 

Using Lemma I III. 41 to bound the contribution of X± , Y± , 
we have 

(x T (c/Eu T )y , x£n e Yo) 

= (S T x X T {UY>V T )YS y , X^N E Y )(1 + o n (l)) 

= (X T (UZV T )Y , SlX^N E Y S y ){l + o„(l)) . (19) 

Further X^N E Y = X£NY + X^(N E ~ N)Y and, using 
once more the bound in Lemma 3.2 of Q, that implies 

\X£(N E - N)Y \ < Cr^/rvrp, we get 

S%X£ N E Y S y = L x Q 2 L^X T NYR y Q 2 R y + E x 
= Z + E 2 , 

where we recall that Z is the diagonal matrix with entries 
given by the singular values of N, and HSxIIj?) Il-^alll' ^ 
C(p,r)y/n. Using this estimate in Eq. (fT9t , together with the 
result in Lemma IIII.2I we finally get 

(XT(UZV t )Yq, XjN E Y Q ) J2 r k=1 Z k a k b k z k 

^rnh\\X^N E Y \\ 2 F ~ ^\\z\\ ° nU ' 

which implies the thesis after simple algebraic manipulations 

■ 
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